Web Searching, Sleuthing and Sifting: Lesson 3: What's next? (Search Engines and Web Indexes)

Site hosted by Angelfire.com: Build your free website today!

Searching and Sleuthing: Search Tools
More about Searching and Searching Engines
What's New with Search Engines
Keeping Current with Web-Based Resources
Web Searching Main
Rest Stop Main Page

Web Searching, Sleuthing and Sifting

Lesson Three:
What's next? (Search Engines and Web Indexes)

In this lesson we will discuss how search engines work in general terms, not all possible scenarios (or search algorithms!).

What is a search engine really and how does it work?

What we think of as a search engine is really a team effort. There are 3 "members" of the team -- a mechanism that identifies web pages to be included in the database, a mechanism that indexes the sites and a searching mechanism with an interface, which scans, for keywords within the index. Users search the index (and hence, the database or web documents) through a query box or a template. Documents in which the search terms occur are presented as "hits."

Although some facilities are beginning to incorporate "natural language" searching (searching by asking a question "Where are the doughnuts?"), most search tools retrieve "hits" or "matches" by seeking occurrences of your search terms within its database and by attempting to match the terms (converted to a "string" of data bits) against its index. Because the terms are converted to a digital string, the search engine must somehow be instructed to include plurals and alternate forms of a term (note: although some search tools automatically include plurals, many do not. If you are interested in "dogs," search for "dog or dogs.")

What's a 'bot?

A 'bot, otherwise known as an intelligent agent, spider, crawler, robot, or worm, is an automated device (software) which may be programmed to search for terms ("strings") matching certain criteria. In terms of web search engines, a 'bot identifies and notes the url's of web pages to be included in the database. Later, another 'bot comes along and works on the interiors of the web documents, recording occurrences of words and their position within the text. This information is used to create a huge index. 'Bots travel along the links of a web site, that is, they crawl or traverse from one hypertext link to another.

What's the index for?

The index is how the search engine locates the url's which match your request. The web documents, containing the query keywords are presented as a listing, which may include a brief summary of the site. A simple way to understand the index is to think of it as a computerized book index. To discover where a topic occurs in a book, we would look up the word in the index which would indicate the page number(s) where the term occurs. Now imagine that every single word is included in the book index. A computerized version might be represented like this:

Keyword	Number of times keyword occurs in book	Position(s) in book of keyword	Page number(s)
Apple	175	title page, page 1: first paragraph word #5, page 2: first paragraph word #20, second paragraph word #15, page 5: 2nd paragraph word 21,...etc. etc., in summary	title page, table of contents, pages: 1,2,5 etc.,12,25.
Orange	22	table of contents, page 3, first paragraph word #3, page 17; first paragraph word #30, page 21 etc.	table of contents, pages 3,17,21 etc.
Grape	3	page 50, 2nd paragraph, word #18, page 52, 1st paragraph word #41, page 53, 1st paragraph word #4	pages 50, 52, 53

Some immediate observations might include:

a) the word apple occurs a lot in the database
b) the word apple occurs in the title
c) the words apple and orange occur in the table of contents
d) the word grape does not occur in the title or table of contents.

A search engine uses its index to retrieve web documents in which your search terms occur. The index lists the term and where it occurs (the url or address of the web page), much like a book index. Remember: a search engine returns hits only from its own database, that is, web pages that it has indexed. So, if the site you are looking for has not yet been indexed, it won't be retrieved in a listing no matter how magnificent your search strategy or statement.

How does a search engine decide how to list web sites matching my search terms?

Each search engine uses a different algorithm or method to calculate something called a "relevance" which it "ranks." Have you ever noticed the numbers which sometimes appear next to the url's in a listing of search results? This is the "relevance ranking." Relevance means the probability that the "hit" or "match" is on-target with your query. The creators of search engines change the way they calculate relevance and do not tell us mere users their methodology; being high in the major search engine's rankings on a topic means big business. Unscrupulous folks "spam" the search engine to try to improve their rankings (and hence, their web-based business). So exactly how a search engine calculates relevance is protected, proprietary information.

Note: because each search engine assigns relevancy rankings differently, if you execute exactly the same search in several search engines, you will have different results in terms of how and where the url's are listed (even if their database contents were identical).

In general, however, relevance is calculated by noting where the term occurs within the text and assigning this position a "weight" or level of importance. Terms occurring in the title, summary, in key positions within a paragraph or appearing several times within a paragraph usually carry more "weight" because there is a higher probability that terms in these positions indicate significant material on the topic.

This is very similar to our book index example above; because the term apple occurs many times and in key positions (title, table of contents, beginning of paragraphs), there is a high probability that the document contains significant information about apple. Note that orange also occurs in the table of contents, an indication of the term's relative importance (it is a significant topic, but not as important as apple). The algorithm of the search engine and the methodology it uses to calculate relevance, emulate the observations and judgments we make based on our experience. A search engine will return our book index as a hit when the search terms apple and grape are requested whereas a human might judge that although the two terms occur within the document, there is no significant relationship between them and is hence irrelevant.

Some search engines look only in certain fields to index documents, such as the title field, first paragraph, and in something called "meta-tags." Meta-tags allow the creator of a web site to add descriptive keywords which are not displayed in the actual web documents; they are specifically to enhance retrieval of the document. As people "spam" the search engine (for example, by repeating terms over and over again), meta-tags are decreasing in importance because the folks that program the 'bots train them to overlook repetitions and other clues to "spamming."

What's the best search engine?

I'm sure I'm going to disappoint a lot of folks by giving the answer "the best search engine is the one that fits the task." Until you have some experience with knowledge seeking tools, and importantly, with identifying your real information need (for example, a query on "Leonardo di Vinci's Mona Lisa" is likely to be more successful than "that lady with the smile by a Renaissance artist" or "dosage and usage guidelines for St. John's Wort" as opposed to "St. John's Wort") it may be difficult to ascertain which tool is best for your purpose. But the good news is, you will make better choices with experience.

What do I use? well, that depends....

Remember I am a librarian in an academic (college) library, so I never know what the next information request will be (that's the fun part!). But this means in practical terms that I am looking for information in a variety of places, which precludes having a standard game plan..... here's a few of my search tactics/favorite tools:

for general use, I use Altavista (http://www.altavista.com). It's fast, returns good hits and is accurate. Plus its database is huge (alternates with Hotbot as the largest web database). Altavista also has a nice refine feature for weeding out irrelevant hits.
for searching by domain (.edu, .com, .gov) I use Hotbot (http://www.hotbot.com). I also use Hotbot for field searching since it has a nice template in its "Advanced Search."
for specific subjects (rather than a specific query), I might use a specialized directory or search engine, particularly in the Arts, Education and Health.
I tend to stay away from meta-search engines (which search multiple search engines at once) because they strip away my Boolean or field commands. I would however, recommend them for general searches where advanced searching techniques will not be used.
if I want to group my hits by related topics, I use NorthernLight (http://www.northernlight.com).
if I want to use concept searching, (find a good web site and then look for others using the same criteria) I use Excite (http://www.excite.com).
frequently I will change tactics in mid search -- if I get too many hits, I'll weed a few out. If I do not find anything relevant, I'll switch to a different source and/or modify my "search statement" or keywords.

More search tips in Lesson 4!

What are simple ways to make my search more effective?

A very effective way to increase the relevance or precision of "hits" is to search as a phrase. In most cases simply means putting quotation marks around the search terms. "Red socks" is a different search than red socks in most search engines. What you are actually doing by searching as a phrase is using the concept of proximity which concerns the terms' physical closeness to one another (their proximity). A document with red socks occurring close or next to each other are more likely to be on target than a document with red in the title and socks buried in the text.

Another way to increase your search effectiveness is to be as specific as possible; that is including as many terms and synonyms as you can think of to fully describe your topic. Instead of

women and computers

try

(woman or women) and (technology or computer) and (training or professional development) and (barriers or problems)

Note: search utilities may not support the use of parentheses or nesting in basic searches, although many support them in their "advanced" searches.

So to recap, phrase searching and specificity are two simple ways to increase precision in searching.

What are the most popular and useful search utilities? (the "major" search engines)

Ok folks. We are looking at a sampling of search engines and describing generalities; we are not attempting to create a definitive listing. For example, we'll be discussing meta search engines in Lesson 6, so you won't find them listed here.

Alta Vista (http://www.altavista.com)

very

Excite (http://www.excite.com)

concept

HotBot (http://www.hotbot.com)

huge

There are more "major search engines" for you to evaluate in Assignments

Specialized Search Engines and Collections:

Specialized search engines are most often programmed to "collect" web documents along a topical theme. For example, in the Arts, Science, Health-related topics or even more specialized subjects such as Ancient History of the Mediterranean.

Also fitting in this category are "search tools" that really calculate rather than retrieve information (such as those fitting in the "distance between two points" or "salary differential" categories). Since it is impossible to list specific tools here, the following are sites which group or list subject specific search engines or tools:

All-in-One Search Page

(http://www.allonesearch.com/)

Beaucoup

(http://www.beaucoup.com)

FinderSeeker: The Search Engine for Search Engines

(http://www.hamrad.com/search.html)

Internet Sleuth

(http://www.isleuth.com/)

For more information:

Maze, Susan, Moxley, David and Smith, Donna, Authoritative Guide to Web Search Engines, Neal-Schuman, 1997 (Book)

Internet Tutorials - University at Albany Libraries -- Search Engines and Subject Directories by Laura Cohen (http://www.albany.edu/library/internet/#search)
The Spider's Apprentice: How to Use Web Search Engines, by Monash Associates

Search Engine Watch, Danny Sullivan, Editor

Searching the World Wide Web: Strategies, Analyzing your topic, Choosing search tools

Assignments

This week we are going on an Infoquest! Please find answers to the following questions using either a subject directory that we discussed in Lesson 2, or a search engine.

Remember -- there are many routes to the same information....

Where can I find information about the wreck of the Titanic? (hint: maritime history)
Where can I find a site telling me about the Chinese New Year? (hint: seasonal)
Where can I find directions to my house? (hint: use a map)
What is icq? (hint: Internet term)
Where can I find statistics on new home owners in Kentucky? (hint: .gov)

Next,

Evaluate the following "major" search engines:

Infoseek (http://www.infoseek.com)
Lycos (http://www.lycos.com)
Northern Light (http://www.northernlight.com)
Webcrawler (http://www.webcrawler.com)

Consider the following criteria:

how large is the database
how frequently is the database updated?
how accurate were your search results?
how easy is the interface to use?
what advanced search facilities are available?

Next:

Find a search facility that will help you find the following types of information. Please include a sample question/reason for inquiry.

a picture of a school bus
someone's email address
the zip code of your best friend
a recipe you saw posted on a newsgroup (not listserv!)
the latest world news
a computerized calendar (software)

https://www.angelfire.com/in/virtuallibrarian/lesson3.html
Last updated: February 24, 1999, Links checked: February 24, 1999

Visitors since 3/10/99:

FastCounter by LinkExchange